{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Cleaning the master metadata\n",
    "\n",
    "#### standardizing authors / titles\n",
    "\n",
    "The authors and titles I got from HathiTrust have a few rough edges. Titles sometimes include a statement about authorship preceded by ```$c```. I don't usually want to treat that as part of the title.\n",
    "\n",
    "Authors' names may be preceded by \"Sir\" or \"Mrs\"; generally I want to move that sort of honorific to the end of the name, so that last name always comes first. (Important for deduplication.)\n",
    "\n",
    "#### volume-part inference\n",
    "\n",
    "Commonly, a multi-volume set of *Works* will have a \"contents\" statement that enumerates the sub-title of each volume. With a bit of careful parsing, we can assign titles to individual volumes, so we have *Ivanhoe* instead of the less informative *Works of Scott,* vol 7.\n",
    "\n",
    "#### date correction\n",
    "\n",
    "The routine I used to infer ```inferreddate``` gave up a little too easily in some cases, and there are zeroes where we could make a better guess. Also, I'd like to add a column for \"last possible date of composition.\" Using information about an author's date of death (!!), or in some cases copyright date, we can infer that some volumes are reprints of much earlier publications.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# a few useful imports\n",
    "\n",
    "import pandas as pd\n",
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>author</th>\n",
       "      <th>authordate</th>\n",
       "      <th>contents</th>\n",
       "      <th>datetype</th>\n",
       "      <th>enddate</th>\n",
       "      <th>enumcron</th>\n",
       "      <th>genres</th>\n",
       "      <th>geographics</th>\n",
       "      <th>imprint</th>\n",
       "      <th>imprintdate</th>\n",
       "      <th>inferreddate</th>\n",
       "      <th>locnum</th>\n",
       "      <th>oclc</th>\n",
       "      <th>place</th>\n",
       "      <th>recordid</th>\n",
       "      <th>startdate</th>\n",
       "      <th>subjects</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>docid</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>njp.32101071963472</th>\n",
       "      <td>Rousseau, Jean-Jacques</td>\n",
       "      <td>1712-1778.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>s</td>\n",
       "      <td></td>\n",
       "      <td>vol. 2</td>\n",
       "      <td>NotFiction</td>\n",
       "      <td>NaN</td>\n",
       "      <td>London;Printed for G.G.J. and J. Robinson, and...</td>\n",
       "      <td>1790</td>\n",
       "      <td>1790</td>\n",
       "      <td>NaN</td>\n",
       "      <td>16894767.0</td>\n",
       "      <td>enk</td>\n",
       "      <td>8980647</td>\n",
       "      <td>1790</td>\n",
       "      <td>Rousseau, Jean-Jacques|1712-1778|Correspondence</td>\n",
       "      <td>The confessions of J.J. Rousseau, citizen of G...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nnc1.0037106139</th>\n",
       "      <td>Savage, Richard</td>\n",
       "      <td>1846-1903.</td>\n",
       "      <td>Copyright ed. ...</td>\n",
       "      <td>s</td>\n",
       "      <td></td>\n",
       "      <td>v.1</td>\n",
       "      <td>Fiction</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Leipzig;Tauchnitz;1899.</td>\n",
       "      <td>1899</td>\n",
       "      <td>1899</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35179607.0</td>\n",
       "      <td>gw</td>\n",
       "      <td>8398383</td>\n",
       "      <td>1899</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The white lady of Khaminavatka; | a story of t...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dul1.ark+=13960=t3nw07208</th>\n",
       "      <td>Riddell, J. H</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>s</td>\n",
       "      <td></td>\n",
       "      <td>v.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>London;Tinsley Brothers;1866.</td>\n",
       "      <td>1866</td>\n",
       "      <td>1866</td>\n",
       "      <td>PR5227.R36R33 1866</td>\n",
       "      <td>2753964.0</td>\n",
       "      <td>enk</td>\n",
       "      <td>10945362</td>\n",
       "      <td>1866</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The race for wealth</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nyp.33433074869615</th>\n",
       "      <td>Irving, Washington</td>\n",
       "      <td>1783-1859.</td>\n",
       "      <td>Surrey ed.</td>\n",
       "      <td>s</td>\n",
       "      <td></td>\n",
       "      <td>v. 2</td>\n",
       "      <td>Fiction</td>\n",
       "      <td>NaN</td>\n",
       "      <td>New York;G. P. Putnam;1896.</td>\n",
       "      <td>1896</td>\n",
       "      <td>1896</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8182806.0</td>\n",
       "      <td>nyu</td>\n",
       "      <td>8665326</td>\n",
       "      <td>1896</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bracebridge hall; or, The humourists.</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nyp.33433068271737</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>New ed.</td>\n",
       "      <td>s</td>\n",
       "      <td></td>\n",
       "      <td>NaN</td>\n",
       "      <td>NotFiction</td>\n",
       "      <td>NaN</td>\n",
       "      <td>London;F.C. &amp; J. Rivington;1810.</td>\n",
       "      <td>1810</td>\n",
       "      <td>1810</td>\n",
       "      <td>NaN</td>\n",
       "      <td>38289890.0</td>\n",
       "      <td>enk</td>\n",
       "      <td>8627815</td>\n",
       "      <td>1810</td>\n",
       "      <td>Religious aspects|Anecdotes|Tracts</td>\n",
       "      <td>Cheap repository tracts: entertaining, moral, ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           author  authordate  \\\n",
       "docid                                                           \n",
       "njp.32101071963472         Rousseau, Jean-Jacques  1712-1778.   \n",
       "nnc1.0037106139                   Savage, Richard  1846-1903.   \n",
       "dul1.ark+=13960=t3nw07208           Riddell, J. H         NaN   \n",
       "nyp.33433074869615             Irving, Washington  1783-1859.   \n",
       "nyp.33433068271737                            NaN         NaN   \n",
       "\n",
       "                                    contents datetype enddate enumcron  \\\n",
       "docid                                                                    \n",
       "njp.32101071963472                       NaN        s           vol. 2   \n",
       "nnc1.0037106139            Copyright ed. ...        s              v.1   \n",
       "dul1.ark+=13960=t3nw07208                NaN        s              v.1   \n",
       "nyp.33433074869615                Surrey ed.        s             v. 2   \n",
       "nyp.33433068271737                   New ed.        s              NaN   \n",
       "\n",
       "                               genres geographics  \\\n",
       "docid                                               \n",
       "njp.32101071963472         NotFiction         NaN   \n",
       "nnc1.0037106139               Fiction         NaN   \n",
       "dul1.ark+=13960=t3nw07208         NaN         NaN   \n",
       "nyp.33433074869615            Fiction         NaN   \n",
       "nyp.33433068271737         NotFiction         NaN   \n",
       "\n",
       "                                                                     imprint  \\\n",
       "docid                                                                          \n",
       "njp.32101071963472         London;Printed for G.G.J. and J. Robinson, and...   \n",
       "nnc1.0037106139                                      Leipzig;Tauchnitz;1899.   \n",
       "dul1.ark+=13960=t3nw07208                      London;Tinsley Brothers;1866.   \n",
       "nyp.33433074869615                               New York;G. P. Putnam;1896.   \n",
       "nyp.33433068271737                          London;F.C. & J. Rivington;1810.   \n",
       "\n",
       "                          imprintdate  inferreddate              locnum  \\\n",
       "docid                                                                     \n",
       "njp.32101071963472               1790          1790                 NaN   \n",
       "nnc1.0037106139                  1899          1899                 NaN   \n",
       "dul1.ark+=13960=t3nw07208        1866          1866  PR5227.R36R33 1866   \n",
       "nyp.33433074869615               1896          1896                 NaN   \n",
       "nyp.33433068271737               1810          1810                 NaN   \n",
       "\n",
       "                                 oclc place  recordid startdate  \\\n",
       "docid                                                             \n",
       "njp.32101071963472         16894767.0   enk   8980647      1790   \n",
       "nnc1.0037106139            35179607.0   gw    8398383      1899   \n",
       "dul1.ark+=13960=t3nw07208   2753964.0   enk  10945362      1866   \n",
       "nyp.33433074869615          8182806.0   nyu   8665326      1896   \n",
       "nyp.33433068271737         38289890.0   enk   8627815      1810   \n",
       "\n",
       "                                                                  subjects  \\\n",
       "docid                                                                        \n",
       "njp.32101071963472         Rousseau, Jean-Jacques|1712-1778|Correspondence   \n",
       "nnc1.0037106139                                                        NaN   \n",
       "dul1.ark+=13960=t3nw07208                                              NaN   \n",
       "nyp.33433074869615                                                     NaN   \n",
       "nyp.33433068271737                      Religious aspects|Anecdotes|Tracts   \n",
       "\n",
       "                                                                       title  \n",
       "docid                                                                         \n",
       "njp.32101071963472         The confessions of J.J. Rousseau, citizen of G...  \n",
       "nnc1.0037106139            The white lady of Khaminavatka; | a story of t...  \n",
       "dul1.ark+=13960=t3nw07208                                The race for wealth  \n",
       "nyp.33433074869615                     Bracebridge hall; or, The humourists.  \n",
       "nyp.33433068271737         Cheap repository tracts: entertaining, moral, ...  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# read the raw data\n",
    "\n",
    "meta = pd.read_csv('mergedficmetadata.tsv', sep = '\\t', index_col = 'docid', low_memory = False)\n",
    "meta.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's create some new columns. Two of them will be blank.\n",
    "# One will contain just volume numbers. To that end, let's\n",
    "# define a function that translates enumcrons to vol\n",
    "# numbers.\n",
    "\n",
    "def justvolnumbers(enum):\n",
    "    \n",
    "    ''' Returns strictly the numeric part of an enumcron,\n",
    "    getting rid of the nonstandard 'v. ' or 'V.' It doesn't\n",
    "    return anything for enums that are like 'c. 2' or \n",
    "    'copy 2'--that's not a volume number.\n",
    "    '''\n",
    "    \n",
    "    if pd.isnull(enum) or len(enum) < 1:\n",
    "        return ''\n",
    "    elif enum.startswith('c') or enum.startswith('(c'):\n",
    "        return ''\n",
    "    else:\n",
    "        matches = re.findall('\\d+', enum)\n",
    "        if len(matches) < 1:\n",
    "            return ''\n",
    "        else:\n",
    "            volnum = int(matches[0])\n",
    "            if volnum < 200 and volnum > 0:\n",
    "                return volnum\n",
    "            else:\n",
    "                return ''\n",
    "\n",
    "meta['volnum'] = meta['enumcron'].map(justvolnumbers)\n",
    "meta['shorttitle'] = ''\n",
    "meta['parttitle'] = ''"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>author</th>\n",
       "      <th>authordate</th>\n",
       "      <th>contents</th>\n",
       "      <th>datetype</th>\n",
       "      <th>enddate</th>\n",
       "      <th>enumcron</th>\n",
       "      <th>genres</th>\n",
       "      <th>geographics</th>\n",
       "      <th>imprint</th>\n",
       "      <th>imprintdate</th>\n",
       "      <th>...</th>\n",
       "      <th>locnum</th>\n",
       "      <th>oclc</th>\n",
       "      <th>place</th>\n",
       "      <th>recordid</th>\n",
       "      <th>startdate</th>\n",
       "      <th>subjects</th>\n",
       "      <th>title</th>\n",
       "      <th>volnum</th>\n",
       "      <th>shorttitle</th>\n",
       "      <th>parttitle</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>docid</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>njp.32101071963472</th>\n",
       "      <td>Rousseau, Jean-Jacques</td>\n",
       "      <td>1712-1778.</td>\n",
       "      <td>NaN</td>\n",
       "      <td>s</td>\n",
       "      <td></td>\n",
       "      <td>vol. 2</td>\n",
       "      <td>NotFiction</td>\n",
       "      <td>NaN</td>\n",
       "      <td>London;Printed for G.G.J. and J. Robinson, and...</td>\n",
       "      <td>1790</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>16894767.0</td>\n",
       "      <td>enk</td>\n",
       "      <td>8980647</td>\n",
       "      <td>1790</td>\n",
       "      <td>Rousseau, Jean-Jacques|1712-1778|Correspondence</td>\n",
       "      <td>The confessions of J.J. Rousseau, citizen of G...</td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nnc1.0037106139</th>\n",
       "      <td>Savage, Richard</td>\n",
       "      <td>1846-1903.</td>\n",
       "      <td>Copyright ed. ...</td>\n",
       "      <td>s</td>\n",
       "      <td></td>\n",
       "      <td>v.1</td>\n",
       "      <td>Fiction</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Leipzig;Tauchnitz;1899.</td>\n",
       "      <td>1899</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>35179607.0</td>\n",
       "      <td>gw</td>\n",
       "      <td>8398383</td>\n",
       "      <td>1899</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The white lady of Khaminavatka; | a story of t...</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dul1.ark+=13960=t3nw07208</th>\n",
       "      <td>Riddell, J. H</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>s</td>\n",
       "      <td></td>\n",
       "      <td>v.1</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>London;Tinsley Brothers;1866.</td>\n",
       "      <td>1866</td>\n",
       "      <td>...</td>\n",
       "      <td>PR5227.R36R33 1866</td>\n",
       "      <td>2753964.0</td>\n",
       "      <td>enk</td>\n",
       "      <td>10945362</td>\n",
       "      <td>1866</td>\n",
       "      <td>NaN</td>\n",
       "      <td>The race for wealth</td>\n",
       "      <td>1</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nyp.33433074869615</th>\n",
       "      <td>Irving, Washington</td>\n",
       "      <td>1783-1859.</td>\n",
       "      <td>Surrey ed.</td>\n",
       "      <td>s</td>\n",
       "      <td></td>\n",
       "      <td>v. 2</td>\n",
       "      <td>Fiction</td>\n",
       "      <td>NaN</td>\n",
       "      <td>New York;G. P. Putnam;1896.</td>\n",
       "      <td>1896</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>8182806.0</td>\n",
       "      <td>nyu</td>\n",
       "      <td>8665326</td>\n",
       "      <td>1896</td>\n",
       "      <td>NaN</td>\n",
       "      <td>Bracebridge hall; or, The humourists.</td>\n",
       "      <td>2</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>nyp.33433068271737</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>New ed.</td>\n",
       "      <td>s</td>\n",
       "      <td></td>\n",
       "      <td>NaN</td>\n",
       "      <td>NotFiction</td>\n",
       "      <td>NaN</td>\n",
       "      <td>London;F.C. &amp; J. Rivington;1810.</td>\n",
       "      <td>1810</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>38289890.0</td>\n",
       "      <td>enk</td>\n",
       "      <td>8627815</td>\n",
       "      <td>1810</td>\n",
       "      <td>Religious aspects|Anecdotes|Tracts</td>\n",
       "      <td>Cheap repository tracts: entertaining, moral, ...</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 21 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                           author  authordate  \\\n",
       "docid                                                           \n",
       "njp.32101071963472         Rousseau, Jean-Jacques  1712-1778.   \n",
       "nnc1.0037106139                   Savage, Richard  1846-1903.   \n",
       "dul1.ark+=13960=t3nw07208           Riddell, J. H         NaN   \n",
       "nyp.33433074869615             Irving, Washington  1783-1859.   \n",
       "nyp.33433068271737                            NaN         NaN   \n",
       "\n",
       "                                    contents datetype enddate enumcron  \\\n",
       "docid                                                                    \n",
       "njp.32101071963472                       NaN        s           vol. 2   \n",
       "nnc1.0037106139            Copyright ed. ...        s              v.1   \n",
       "dul1.ark+=13960=t3nw07208                NaN        s              v.1   \n",
       "nyp.33433074869615                Surrey ed.        s             v. 2   \n",
       "nyp.33433068271737                   New ed.        s              NaN   \n",
       "\n",
       "                               genres geographics  \\\n",
       "docid                                               \n",
       "njp.32101071963472         NotFiction         NaN   \n",
       "nnc1.0037106139               Fiction         NaN   \n",
       "dul1.ark+=13960=t3nw07208         NaN         NaN   \n",
       "nyp.33433074869615            Fiction         NaN   \n",
       "nyp.33433068271737         NotFiction         NaN   \n",
       "\n",
       "                                                                     imprint  \\\n",
       "docid                                                                          \n",
       "njp.32101071963472         London;Printed for G.G.J. and J. Robinson, and...   \n",
       "nnc1.0037106139                                      Leipzig;Tauchnitz;1899.   \n",
       "dul1.ark+=13960=t3nw07208                      London;Tinsley Brothers;1866.   \n",
       "nyp.33433074869615                               New York;G. P. Putnam;1896.   \n",
       "nyp.33433068271737                          London;F.C. & J. Rivington;1810.   \n",
       "\n",
       "                          imprintdate    ...                 locnum  \\\n",
       "docid                                    ...                          \n",
       "njp.32101071963472               1790    ...                    NaN   \n",
       "nnc1.0037106139                  1899    ...                    NaN   \n",
       "dul1.ark+=13960=t3nw07208        1866    ...     PR5227.R36R33 1866   \n",
       "nyp.33433074869615               1896    ...                    NaN   \n",
       "nyp.33433068271737               1810    ...                    NaN   \n",
       "\n",
       "                                 oclc place  recordid  startdate  \\\n",
       "docid                                                              \n",
       "njp.32101071963472         16894767.0   enk   8980647       1790   \n",
       "nnc1.0037106139            35179607.0   gw    8398383       1899   \n",
       "dul1.ark+=13960=t3nw07208   2753964.0   enk  10945362       1866   \n",
       "nyp.33433074869615          8182806.0   nyu   8665326       1896   \n",
       "nyp.33433068271737         38289890.0   enk   8627815       1810   \n",
       "\n",
       "                                                                  subjects  \\\n",
       "docid                                                                        \n",
       "njp.32101071963472         Rousseau, Jean-Jacques|1712-1778|Correspondence   \n",
       "nnc1.0037106139                                                        NaN   \n",
       "dul1.ark+=13960=t3nw07208                                              NaN   \n",
       "nyp.33433074869615                                                     NaN   \n",
       "nyp.33433068271737                      Religious aspects|Anecdotes|Tracts   \n",
       "\n",
       "                                                                       title  \\\n",
       "docid                                                                          \n",
       "njp.32101071963472         The confessions of J.J. Rousseau, citizen of G...   \n",
       "nnc1.0037106139            The white lady of Khaminavatka; | a story of t...   \n",
       "dul1.ark+=13960=t3nw07208                                The race for wealth   \n",
       "nyp.33433074869615                     Bracebridge hall; or, The humourists.   \n",
       "nyp.33433068271737         Cheap repository tracts: entertaining, moral, ...   \n",
       "\n",
       "                          volnum shorttitle parttitle  \n",
       "docid                                                  \n",
       "njp.32101071963472             2                       \n",
       "nnc1.0037106139                1                       \n",
       "dul1.ark+=13960=t3nw07208      1                       \n",
       "nyp.33433074869615             2                       \n",
       "nyp.33433068271737                                     \n",
       "\n",
       "[5 rows x 21 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Just to test what we produced:\n",
    "\n",
    "meta.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### volume-part inference\n",
    "\n",
    "Basically, we want to be able to translate a contents statement, and convert it into a dictionary where volume numbers map to titles of individual volumes, like so:\n",
    "\n",
    "![caption](files/parsed.png)\n",
    "\n",
    "That's not terribly hard, with a regex:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def volmap(contents):\n",
    "    \n",
    "    ''' A function that turns a \"contents\" statement into a dictionary\n",
    "    of titles.\n",
    "    '''\n",
    "    themap = dict()\n",
    "    if pd.isnull(contents):\n",
    "        return themap\n",
    "    if len(contents) < 4:\n",
    "        return themap\n",
    "    \n",
    "    contents = contents.replace('XVI.', '16')\n",
    "    contents = contents.replace('XV.', '15')\n",
    "    contents = contents.replace('XIV.', '14')\n",
    "    contents = contents.replace('XIII.', '13')\n",
    "    contents = contents.replace('XII.', '12')\n",
    "    contents = contents.replace('XI.', '11')\n",
    "    contents = contents.replace('IX.', '9')\n",
    "    contents = contents.replace('X.', '10')\n",
    "    contents = contents.replace('VIII.', '8')\n",
    "    contents = contents.replace('VII.', '7')\n",
    "    contents = contents.replace('VI.', '6')\n",
    "    contents = contents.replace('IV.', '4')\n",
    "    contents = contents.replace('V.', '5')\n",
    "    contents = contents.replace('III.', '3')\n",
    "    contents = contents.replace('II.', '2')\n",
    "    contents = contents.replace('I.', '1')\n",
    "    \n",
    "    sequence = re.findall(r'\\D+|\\d+', contents)\n",
    "    \n",
    "    # The regex above does most of the work in this function, translating the\n",
    "    # contents statement into a sequence of alternating alphabetic and numeric\n",
    "    # sections.\n",
    "    \n",
    "    if len(sequence) < 3:\n",
    "        return themap\n",
    "    \n",
    "    started = False\n",
    "    hyphen = False\n",
    "    \n",
    "    for s in sequence:\n",
    "        if s.isdigit() and not started:\n",
    "            started = True\n",
    "            nextvols = [int(s)]\n",
    "            expectation = int(s) + 1\n",
    "        elif not started:\n",
    "            pass\n",
    "        elif s == '-':\n",
    "            hyphen = True\n",
    "        elif s.isdigit() and hyphen:\n",
    "            if int(s) < expectation:\n",
    "                hyphen = False\n",
    "                pass\n",
    "            elif len(nextvols) == 1:\n",
    "                for i in range(nextvols[0], int(s) + 1):\n",
    "                    nextvols.append(i)\n",
    "                expectation = int(s) + 1\n",
    "                hyphen = False\n",
    "            else:\n",
    "                hyphen = False\n",
    "                pass\n",
    "        elif s.isdigit():\n",
    "            if int(s) == expectation:\n",
    "                nextvols = [int(s)]\n",
    "                expectation = int(s) + 1\n",
    "            else:\n",
    "                pass\n",
    "        else:\n",
    "            for n in nextvols:\n",
    "                themap[n] = s.strip('., -v[]()')\n",
    "    \n",
    "    return themap\n",
    "                                "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We also need to clean up titles, by getting rid of the part after \"$c,\" along with various extra punctuation characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def short_title(longtitle):\n",
    "    if \"$c\" in longtitle:\n",
    "        parts = longtitle.split(\"$c\")\n",
    "        justtitle = parts[0]\n",
    "    else:\n",
    "        justtitle = longtitle\n",
    "    \n",
    "    shorttitle = justtitle.strip('| /.,').replace(' | ', ' ')\n",
    "    return shorttitle"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's actually do the work."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "grouped = meta.groupby('recordid')\n",
    "ctr = 0\n",
    "for record, group in grouped:\n",
    "    ctr += 1\n",
    "    if ctr % 100 == 1:\n",
    "        print(ctr)\n",
    "    maxlen = 0\n",
    "    longest = ''\n",
    "    for cont in group.contents:\n",
    "        if pd.isnull(cont):\n",
    "            continue\n",
    "        elif len(cont) > maxlen:\n",
    "            maxlen = len(cont)\n",
    "            longest = cont\n",
    "    themap = volmap(longest)\n",
    "\n",
    "    for idx in group.index:\n",
    "        volnum = group.loc[idx, 'volnum']\n",
    "        if type(volnum) == int and volnum in themap:\n",
    "            meta.loc[idx, 'parttitle'] = themap[volnum]\n",
    "            meta.loc[idx, 'shorttitle'] = themap[volnum]\n",
    "        else:\n",
    "            meta.loc[idx, 'shorttitle'] = short_title(meta.loc[idx, 'title'])\n",
    "            \n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Author standardization\n",
    "\n",
    "Move those honorifics to the end of the name.\n",
    "\n",
    "Also, while we're at it, let's redress a couple of historical injustices that affect prominent authors in ways that would complicate deduplication."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "def flip_honorific(auth):\n",
    "    if pd.isnull(auth):\n",
    "        return ''\n",
    "    elif auth == 'Ward, Humphry, Mrs' or auth == \"Mrs. Humphry Ward\" or auth == \"Mrs., Ward, Humphry\" or auth == 'Ward, Humphry':\n",
    "        return \"Ward, Mary Augusta\"\n",
    "    elif auth == 'Wood, Henry, Mrs' or auth == \"Mrs. Henry Wood\" or auth == \"Mrs., Wood, Henry\" or auth == 'Wood, Henry':\n",
    "        return \"Wood, Ellen\"\n",
    "    \n",
    "    # yes, in principle that's unfair to the real Humphry Ward and Henry Wood\n",
    "    # however, in practice ...\n",
    "    \n",
    "    elif auth.startswith('Sir') or auth.startswith('Mrs'):\n",
    "        return auth[3: ].strip('. ,') + ', ' + auth[0:3]\n",
    "    elif auth.startswith('Lady'):\n",
    "        return auth[4: ].strip('. ,') + ', ' + auth[0:4]\n",
    "    elif auth.startswith('(') and ')' in auth:\n",
    "        parts = auth.split(')')\n",
    "        firstpart = parts[1].strip('., ')\n",
    "        name = firstpart + \" \" + parts[0] + \")\"\n",
    "        return name\n",
    "    elif auth == 'Baron, Dunsany, Edward John Moreton Drax Plunkett':\n",
    "        return 'Dunsany, Edward John Moreton Drax Plunkett'\n",
    "    elif auth == 'Baron, Lytton, Edward Bulwer Lytton':\n",
    "        return 'Lytton, Edward Bulwer Lytton'\n",
    "    elif auth == 'Baroness, Orczy, Emmuska Orczy':\n",
    "        return 'Orczy, Emmuska Orczy'\n",
    "    else:\n",
    "        return auth\n",
    "\n",
    "meta['cleanauth'] = meta['author'].map(flip_honorific)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Date correction\n",
    "\n",
    "Fixing a few inferred dates, adding a column for last possible date of composition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "meta['latestcomp'] = ''\n",
    "\n",
    "for idx in meta.index:\n",
    "    infer = meta.loc[idx, 'inferreddate']\n",
    "    if int(infer) == 0:\n",
    "        try:\n",
    "            newdate = int(meta.loc[idx, 'startdate'])\n",
    "            if newdate > 1699 and newdate < 2100:\n",
    "                meta.loc[idx, 'inferreddate'] = newdate\n",
    "            else:\n",
    "                newdate = newdate = int(meta.loc[idx, 'enddate'])\n",
    "                if newdate > 1699 and newdate < 2100:\n",
    "                    meta.loc[idx, 'inferreddate'] = newdate\n",
    "        except:\n",
    "            pass\n",
    "        \n",
    "    authdate = meta.loc[idx, 'authordate']\n",
    "    \n",
    "    death = 3000\n",
    "    if not pd.isnull(authdate):\n",
    "        authdate = authdate.strip(',.')\n",
    "        if '-' in authdate and len(authdate) > 6:\n",
    "            try:\n",
    "                death = int(authdate[-4: ])\n",
    "            except:\n",
    "                death = 3000\n",
    "        else:\n",
    "            death = 3000\n",
    "    \n",
    "    datetype = meta.loc[idx, 'datetype']\n",
    "    if datetype == 'c' or datetype == 't' or datetype == 'r':\n",
    "        try:\n",
    "            firstpub = int(meta.loc[idx, 'enddate'])\n",
    "        except:\n",
    "            firstpub = 3000\n",
    "    else:\n",
    "        firstpub = 3000\n",
    "    \n",
    "    infer = int(meta.loc[idx, 'inferreddate'])\n",
    "    if infer < 1700:\n",
    "        infer = 2100\n",
    "    \n",
    "    if death < 1700:\n",
    "        death = 2100\n",
    "    \n",
    "    if firstpub < 1700:\n",
    "        firstpub = 2100\n",
    "    \n",
    "    meta.loc[idx, 'latestcomp'] = min(death, infer, firstpub)\n",
    "    \n",
    "    \n",
    "            "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### now write to file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "cols_in_order = ['author', 'cleanauth', 'authordate',  'inferreddate', 'latestcomp', 'datetype', 'startdate', 'enddate', 'imprint',\n",
    " 'imprintdate', 'contents', 'genres',  'subjects', 'geographics', 'locnum', 'oclc', 'place', 'recordid',\n",
    " 'enumcron', 'volnum', 'title', 'parttitle', 'shorttitle']\n",
    "outmeta = meta[cols_in_order]\n",
    "outmeta.sort_values(by = ['inferreddate', 'recordid', 'volnum'], inplace = True)\n",
    "outmeta.to_csv('masterficmetadata.tsv', sep = '\\t')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
